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Abstract 

This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon- 
free index policy for stochastic bandit problems. We prove two distinct results: first, 
for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret 
bound than UCB and its variants; second, in the special case of Bernoulli rewards, it 
reaches the lower bound of Lai and Robbing. Furthermore, we show that simple adaptations 
of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) 
rewards, including those generated from exponential families of distributions. A large-scale 
numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, 
UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short 
time horizons. KL-UCB is also the only method that always performs better than the basic 
UCB policy. Our regret bounds rely on deviations results of independent interest which are 
stated and proved in the Appendix. As a by-product, we also obtain an improved regret 
bound for the standard UCB algorithm. 

1 Introduction 

The multi-armed bandit problem is a simple, archetypal setting of reinforcement learning, where an 
agent facing a slot machine with several arms tries to maximize her profit by a proper choice of arm 
draws. In the stochasticQ bandit problem, the agent sequentially chooses, for t = 1, 2, . . . , n, an arm 
At G {!,..., K}, and receives a reward Xt such that, conditionally on the arm choices Ai, A2, . . . , 
the rewards are independent and identically distributed, with expectation HAi,t^A2j ■ ■ ■ ■ Her policy 
is the (possibly randomized) decision rule that, to every past observations (Ai, Ai, . . . , A(_i, At_i), 
associates her next choice At- The best choice is any arm a* with maximal expected reward fj-a* ■ The 
performance of her policy can be measured by the regret i?„ , defined as the difference between the 
rewards she accumulates up to the horizon t — n, and the rewards that she would have accumulated 
during the same period, had she known from the beginning which arm had the highest expected 
reward. 

The agent typically faces an "exploration versus exploitation dilemma" : at time t, she can take 
advantage of the information she has gathered, by choosing the so-far best performing arm, but 
she has to consider the possibility that the other arms ar e actua lly under-rated and she must play 
sufficiently often all of them. Since the works of iGittiiisI (|l979l ) in the 1970s, this problem raised 
much interest and several variants, solutions and extensions have been proposed, see lEven-Dar et all 
(|2002f ) and references therein. 

Two families of bandit settings can be distinguished: in the first family, the distribution of Xt 
given At = a is assumed to bel ong to a family \ pg,9 G 0^} of probability distributions. In a 
particular parametric framework, iLai and RobbinsI (jl985() proved a lower-bound on the performance 
of any poli cy, and determined optimal pol icies. This framework was extended to multi-parameter 
models by iBurnetas and KatehakisI ()1997l) who showed that the number of draws up to time n, 
Na{n), of any sub-optimal arm a is lower-bounded by 



Na{n) > inf — - +oil) log(n), (1) 



^Another interesting setting is the adversarial bandit problem, where the rewards are not stochastic but 
chosen by an opponent - this setting is not the subject of this paper. 



where KL denotes the Kullback-Leibler divergence and E{pg) is the expectation under pg; hence, 
the regret is lower-bounded as follows: 

liminfi&i> y inf ~ ^\ . (2) 

n-i-oo log(n) ^^'^^ ^9eea:E[pe]>ti^, KL[pg^,pg) 

Recently, iHonda and Takemural (|2010[ ) proposed an algorithm called Deterministic Minimum Em- 
pirical Divergence (DMED) which they proved to be first order optimal. This algorithm, which 
maintains a list of arms that are close enough to the best one (and which thus must be played), is 
inspired by large deviations ideas and relies on the availability of the rate function associated to the 
reward distribution. 

In the second family of bandit problems, the rewards are only assumed to be bounded (say, 
between and 1), and policies rely directly on the estimates of the expected rewards for each 
arm. The KL-UCB algorithm considered in this paper is primarily meant to address this second, 
non-parametric, setting. We w ill nonetheless show that KL-UCB also matches the lower bound of 
iBurnetas and KatehakisI (jl997t ) in the binary case and that it can be extended to a larger class of 
parametric bandit problems. 

Among all the bandit po licies that have been proposed, a particular family has raised a strong 
interest, after iGittinsI ()1979[ ) proved that (in the Bayesian setting he considered) optimal policies 
could be chosen in the following very special form: compute for each arm a dynamic allocation index 
(which only depends on the draws of this arm), and simply choose an arm with maximal index. 
These policies not only compute an estimate of the expected rewards, but rather an upper- confidence 
hound (UCB), and the agent's choice is an arm with highest UCB. This approach is sometimes called 
"optimistic" , as the agent acts as if, at each instant, the expec ted rewards were equal to th e highest 
possible values that are compatible with her past observations. lAuer et all (|2002[ ). following I A grawal 
(|1995i ) . proposed and analyzed two variants, UCBl (usually called simply UCB in latter works) and 
UCB2, for which t hey provided regret bounds. UCB is an online, horizon-free procedure for which 
(|Auer et al.l |2002[) proves that there exists a constant C such that 

The UCB2 variant relies on a parameter a that needs to be tuned, depending in particular on the 
horizon, and satisfies the tighter regret bound 



WTR 1 ^ (1 + e(a))log(?i) 

^ 2(^..-Ma) 



where e(a) > is a constant that can get arbitrary small when a is small, at the expense of an 
increased value of the constant C{a). The constant 1/2 in front of the factor log(n)/ (/Iq* — /i^) 
cannot be improved. We show in Proposition 21 as a by-pr oduct of ou r ana lysis, that a correctly 
tuned UCB algorithm satisfies a similar bound. However, lAuer et all (j200 2) found in numerical 
experiments that UCB and UCB2 were outperformed by a third heuristic variant called UCB-Tuned, 
which includes estima tes of the variance, but no theoretical guarantee was given. In a latter work, 
lAudibert et al.l (|2009t ) proposed a related policy, called UCB-V, which uses an empirical version of 
the B ernstein bound to obtain refined upper confidence bounds. Recently, lAudibert and Bubeckl 
(j2010l ) introduced an improved UCB algorithm, termed MOSS, which achieves the distribution-free 
optimal rate. 

In this contribution, we first consider the stochastic, non-parametric, bounded bandit problem. 
We consider an online index policy, called KL-UCB (for Kullback-Leibler UCB), t hat requires n o 
problem- or horizon-dependent tuning. This algorithm was rec ently advocated by iFilippil (|2010f ) , 
together with a similar procedure for Markov Decision Processes (jFilippi et al.l . l2Ql(3D ^nd we learnt 
since o ur initial submission that an analysis of the Bernoulli case can also be found in lMaillard et al.l 
(|2011f ). together with other results. We prove in Theorem |T] below that the regret of KL-UCB 
satisfies 

E[i?„] ^ ^ Ma' - Ma 

hmsup- — -- < > — r, 

n->oo log(n) a-.^Z^t,^. d{fla,fJ-a') 

where d{p,q) = p\og{p/q) -t- (1 — p)log((l — p/{l — q)) denotes the Kullback-Leibler divergence 
between Bernoulli distributions of parameters p and g, respectively. This comes as a consequence of 
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Theorem [21 a non-asymptotic upper-bound on the number of draws of a sub-optimal arm a: for all 
e > there exist Ci, C2(e) and /3(e) such that 

mnia)] < A l + e) + Ci log(log(n)) + ^ . 

We insist on the fact that, despite the presence of divergence d, this bound is not specific to the 
Bernoulli case and applies to all reward distributions bounded in [0, 1] (and thus, by rescaling, to all 
bounded reward distributions). By Pinsker's inequality, d(/ia, Ma*) > 2(/^q— /x^*)^, and thus KL-UCB 
has strictly better theoretical guarantees than UCB, while it has the same range of application. The 
improvement appears to be signi ficant in simulations. M oreover, KL-UCB is the first index policy 
that reaches the lower-bound of iLai and Robbini (jl985() for binary rewards; it does also achieve 
lower regret than UCB-V in that case. Hence, KL-UCB is both a general-purpose procedure for 
bounded bandits, and an optimal solution for the binary case. 

Furthermore, it is easy to adapt KL-UCB to particular (possibly non-bounded) bandit settings, 
when the distribution of reward is known to belong to some family of probability laws. By simply 
changing the definition of the divergence d, optimal algorithms may be built in a great variety of 
situations. 

The proofs we give for these results are short and elementary. They rely on deviation results for 
bounded variables that are of independent interest : Lemma [S] shows that Bernoulli variable are, in 
a sense, the "least concentrated" bounded variables with a given expectation (as is well-known for 
variance), and Theorem 1101 shows an efficient way to build confidence intervals for sums of bounded 
variables with an unknown number of summands. 

In practice, numerical experiments confirm the significant advantage of KL-UCB over existing 
procedures; not only does this method outperform UCB, MOSS, UCB-V and even UCB-tuned in 
various scenarios, but it also compares well to DMED in the Bernoulli case, especially for small or 
moderate horizons. 

The paper is organized as follows: in Section [21 we introduce some notation and present the 
KL-UCB algorithm. Section [3] contains the main results of the paper, namely the regret bound 
for KL-UCB and the optimality in the Bernoulli case. In Section [H we show how to adapt the 
KL-UCB algorithm to address general families of reward distributions, and we provide finite-time 
regret bounds showing asymptotic optimality. Section [S] reports the results of extensive numerical 
experiments, showing the practical superiority of KL-UCB. Section [6] is devoted to an elementary 
proof of the main theorem. Finally, the Appendix gathers some deviation results that are useful in 
the proofs of our regret bounds, but which are also of independent interest. 

2 The KL-UCB Algorithm 

We consider the following bandit problem: the set of actions is {l,...,iir}, where K denotes a 
finite integer. For each a G {!,..., -fC}, the rewards (^a,t)t>i are independent and bounde43 in 
Q — [0, 1] with common expectation ^.a- The sequences {Xa^-)a are independent. At each time step 
t = 1, 2, . . . , the agent chooses an action At according to his past observations (possibly using some 
independent randomization) and gets the reward Xt = ^At,Af^ {t)^ where Na{t) = X)l=i l{^s = ct} 
denotes the number of times action a was chosen up to time t. The sum of rewards she has obtained 
when choosing action a is denoted by Sa{t) — X]s<t ^{^s — o.}Xs — Y^^^f'' -^a,s- For (p, q) £ 0^ 
denote the Bernoulli Kullback-Leibler divergence by 

P 1 — V 

d{p, g) = plog - + (1 - p) log , 

q 1-q 

with, by convention, OlogO ~ OlogO/0 = and a;loga;/0 = +oo for x > 0. 

Algorithm [H provides the pseudo-code for KL-UCB. On line 6, c is a parameter that, in the regret 
bound stated below in Theorem [1] is chosen equal to 3; in practice, however, we recommend to take 
c = for optimal performance. For each arm a the upper-confidence bound 

max |g e e : N[a] d (Jj^^l^ < log(*) + clog(log(t)) 

can be efficiently computed using Newton iterations, as for any p S [0, 1] the function q i— )• d{p,q) 
is strictly convex and increasing on the interval [p, 1]. In case of ties between severals arms, any 
maximizer can be c hosen (fo r instanc e, at r andom). The KL-UCB elaborates on ideas suggested in 
Sections 3 and 4 of iLai and Robbinsi (|1985[ ). 



if the rewards are bounded in another interval [a,h], they should first be rescaled to [0, 1]. 
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Algorithm 1 KL-UCB 

Require: n (horizon), K (number of arms), REWARD (reward function, bounded in [0, 1]) 
1: for t^lto K do 
2: N[t] ^ 1 

3: S[t] ^ REWARD(arm = t) 
4: end for 

5: for t — K + 1 to n do 

6: a ^ argmaxi<„<^max|g e e : A[a] d < log(i) + clog(log(t))| 

7: r i- REWARD (arm = a) 

8: N[a] ^ N[a] + 1 

9: S[a] ^ S[a] + r 

10: end for 



3 Regret bounds and optimality 

We first state the main result of this paper. It is a direct consequence of the non-asymptotic bound 
in Theorem [5] stated below. 

Theorem 1 Consider a bandit problem with K arms and independent rewards bounded in [0,1], 
and denote by a* an optimal arm. Choosing c = 3, the regret of the KL-UCB algorithm satisfies: 

E[Rn] ^ Ma* - Ma 

lim sup - — - — r- < 



Theorem 2 Consider a bandit problem with K arms and independent rewards bounded in [0,1]. 
Let e > 0, and take c = S in Algorithmic Let a* denote an arm with maximal expected reward fj,a* , 
and let a be an arm such that < /^a* • For any positive integer n, the number of times algorithm 
KL-UCB chooses arm a is upper-bounded by 

IE[iV„(a)] < . (1 + 6) + Ci log(log(n)) + ^ , 

where Ci denotes a positive constant and where C2(e) and /3(e) denote positive functions of e. Hence, 

lim sup — j — -I—— < 



log(n) d{fla,fJ-a*) 

Corollary 3 // the reward distributions are Bernoulli, the KL-UCB algorithm is asymptotically 
optimal, in the sense that the regret of KL- U CB ma t ches the lower-bound proved bu \Lai and Robbing 
fT98di ) and generalized bu \Burnetas and Katehakii il99ll ): 



Nnia)> [-. -+o(l) log(n) 



with a probability tending to 1. 



The KL-UCB algorithm thus appears to be (asymptotically) optimal for Bernoulli rewards. 
However, Lemma IH] shows that the ChernofF bounds obtained for Bernoulli variables actually apply 
to any variable with range [0, 1]. This is why KL-UCB is not only efficient in the binary case, but 
also for general bounded rewards. 

As a by-product, we obtain an improved upper-bound for the regret of the UCB algorithm: 

Proposition 4 Consider the UCB algorithm tuned as follows: at step t > K, the arm that maxi- 
mizes the upper-bound S[a]/N[a] + (log(t) -|- cloglog(i))/ (2A^[a]) is chosen. Then, for the choice 
c — 3, the number of draws of a sub-optimal arm a is upper-bounded as: 

mnia)] < ^, „ (1 + e) + C, log(log(n)) + ^ • (4) 

This bound is "optimal", in the sense that the constant 1/2 in the logarithmic term cannot be 
improved. The proof of this proposition just mimics that of Section |6] (which concerns KL-UCB), 
using the quadratic divergence d{p,q) := 2{p — qf instead of the Kullback-Leibler divergence; it 
is thus omitted. In contrast, Pinsker's inequality d{iJ,a, IJ-a*) > 2(/ia — /ia*)^ shows that KL-UCB 
dominates UCB, and we will see in the simulation study that the difference is significant, even for 
smaller values of the horizon. 
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Remark 5 At line 6, Algorithm{l\ computes for each arm a G {1, . . . , K} the upper- confidence bound 



max |g e e : N[a] d [jj^^l^ < log(0 + clog(log(t)) 

The level of this confidence bound is parameterized by the exploration function log(t) + clog(log(t)), 
and the results of Theorems{Ji and\^ are true as soon as c > 3. However, similar results can be 
proved with an exploration function equal to (1 + e) log(i) (instead o/log(t) + clog(log(i)) ) for every 
e > 0; this is no surprise, as (l + e)log(t) > log(i) + clog(log(i)) when t is large enough. But "large 
enough" , in that case, can be quite large : for e = 0.1, this holds true only for t > 2.10'^^. This is 
why, in practice (and for the simulations presented in Section\^, we rather suggest to choose c — 0. 



4 KL-UCB for parametric families of reward distributions 

The KL-UCB algorithm makes no assumption on the distribution of the rewards, except that they 
are bounded. Actually, the definition of the divergence function d in KL-UCB is dictated by the 
rate function of the Large Deviations Principle satisfied by Bernoulli random variables: the proof 
of Theorem [TO] relies on the control of the Fenchel-Legendre transform of the Bernoulli distribution. 
Thanks to Lemma [9l this choice also makes sense for all bounded variables. 

But the method presented here is not limited to the Bernoulli case: KL-UCB can very easily be 
adapted to other settings by choosing an appropriate divergence function d. As an illustration, we 
will assume in this section that, for each arm a, the distribution of rewards belongs to a canonical 
exponential family, i.e., that its density with respect to some reference measure can be written as 
pe^ (x) for some real parameter 6a , with 

pe{x)=exp{x9-b{e)+c{x)) , (5) 

where is a real parameter, c is a real function and the log-partition function &(•) is assumed 
to be twice differentiable. This family contains for instance the Exponential, Poisson, Gaussian 
(with fixed variance). Gamma (with fixed shape parameter) distributions (as well as, of course, the 
Bernoulli distribution). For a random variable X with density defined in ([5]), it is easily checked 
that /i(6') = Eg [AT] = b{9); moreover, as b{9) = Yar{X) > 0, the function 9 M- ^{9) is one-to-one. 
Theorem [11] (in the Appendix) states that the probability of under-estimating the performance of 
the best arm can be upper-bounded just as in the Bernoulli case by replacing the divergence c?(-, •) 
in line 6 of the KL-UCB algorithm by 

d{x,fi{9)) = sup{A.x - logEe [exp(AA:)]} . 

A 

For example, in the case of exponential rewards, one should choose d{x,y) = x/y — 1 — log(a::/y). 
Or, for Poisson rewards, the right choice is d{x, y) = y — x + x \og{x/y). Then, all the results stated 
above apply (as the proofs do not involve the particular form of the function d), and in particular : 

hm sup - — -— < > — r . 

n^oo log(n) a-.^f^,,^, d{fla,fJ-a-') 

In ord er to prove that this version of the KL-UCB algorithm matches the bound of iLai and Robbinsl 
(|l985i ) for the families of rewards, it remains only to show that d{x,y) = -firL(pp-i(j,),p^-i(y)). 
This is the object of Lemma [5] Generalizations to other families of reward distributions (possibly 
di fferent from arm to arm) are co ncei vable, but require more techn ical, topological discussions, as 
in lBurnetas and Katehakig (|1997l ) and iHonda and Takemural (|2010[ ). 

To conclude, observe that it is not required to work with the divergence function d corresponding 
exactly to the family of reward distributions: using an upper-bound instead often leads to more 
simple and versatile policies at the price of only a slight loss in performance. This is illustrated in 
the third scenario of the simulation study presented in Section [5] but also by Theorems [T] and [5] for 
bounded variables. 



Lemma 6 Let (/3, 9) be two real numbers, let pp and pe be two probability densities of the canonical 
exponential family defined in ([SJ, and let X have density pg. Then Kullback-Leibler divergence 
KL{pi3,p0) is equal to d{^{l3) , ^{9)) . More precisely, 

KL{pp,pg) = dW),l^{0)) = m {(3-9)- b{f3) + b{9) . 
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Proof: First, it holds that 

KL{pp,pe) = j exp {xP - b{P) + c{x)) {x (13-9)- h{fi) + b{e)} dx = fi{l3) {fi - 9) - b{P) + h{e) . 

Then, observe that E [exp(AX)] = / exp (x(/3 + A) - 6(/3) + c{x)) dx = exp(6(/3 + A) - fe(/3)) . Thus, 
for every x the maximum of the (smooth, concave) function 

A Aa; - logE [exp(AX)] = Aa; - 6(6* + A) + h{e) 

is reached for A = A* such that x — h{9 + A*) = /i(6' + A*). Thus, \i x = IJ.{I3), the fact that fi is 
one-to-one imphes that 6 + X* — f5 and thus that: 

m) = w-o) /i(/3) - &(/?) + m • 



5 Numerical experiments and comparisons of the policies 

Simulations studies re quire particular attention in the case of bandit algorithms. As pointed out by 
lAudibert et al.l (|2009( ). for a fixed horizon n the distribution of the regret is very poorly concentrated 
around its expectation. This can be explained as follows: most of the time, the estimates of all arms 
remain correctly ordered for almost all instants t — 1, . . . ,n and the regret is of order log(n). But 
sometimes, at the beginning, the best arm is under-estimated while one of the sub-optimal arms 
is over-estimated, so that the agent keeps playing the latter; and as she neglects the best arm, she 
has hardly an occasion to realize her mistake, and the error perpetuates for a very long time. This 
happens with a small, but not negligible probability, because the regret is very important (of order 
n) on these occasions. Bandit algorithms are typically designed to control the probability of such 
adverse events but usually at a rate which only decays slightly faster than 1/n, which results in very 
skewed regret distributions with slowly vanishing upper tails. 




n (log scale) 

Figure 1: Performance of the various algorithms in the simple two arms, scenario. Left, mean number 
of draws of the suboptimal arm as a function of time; right, box-plots showing the distributions of 
the number of draws of the suboptimal arm at time n = 5, 000. Results based on 50, 000 independent 
runs. 



5.1 Scenario 1: two arms 

We first consider the basic two arm scenario with Bernoulli rewards of expectations ni = 0.9 and fi2 — 
0.8, respectively. The left panel of Figure[T]shows the average number of draws of the suboptimal arm 
as a function of time (on a logarithmic scale) for KL-UCB compared to five benchmark algorithms 
(UCB, MOSS, UCB-Tuned, UCB-V and DMED). The right panel of Figure [D shows the empirical 
distributions of suboptimal draws, represented as box-and-whiskers plots, at a particular time {t = 
5,000) for all six algorithms. These plot are obtained from N — 50,000 independent runs of the 
algorithms and the right panel of Figure [T] clearly highlight the tail effect mentioned above. On 
this very simple example, we observed that results obtained from N = 1,000 or less simulations 
were not reliable, typically resulting in a significant over-estimatior(3 of the performance of "risky" 

^Incidentally, Theorem [10] could be used to construct sharp confidence bounds for the regret. 
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algorithms, in particular of UCB-Tuned. More generally, results obtained in configurations where 
N is much smaller than n are likely to be unreliable. For this reason, we limit our investigations to 
a final instant of n = 20, 000. Note however that the average number of suboptimal draws of most 
algorithms at n = 20, 000 is only of the order of 300, showing that there is no point in considering 
larger horizons for such a simple problem. 

MOSS, UCB-Tuned and UCB-V are run exactly as described bv lAudibert and BubecH (|2010[ ). 
lAuer et al.l ()2002[ ) and lAudibert et al.l ()2009f ). respectively. For UCB, we use an upper confidence 
bound S[a]/N[a\ + y/log{t)/{2N[a]), as in Proposition^ again with c — 0. Note that in our two arm 
scenario, {2(^i — /i2)^}~^ = 50 while (i~^(/i2, /^i) — 22.5. Hence, the performance of DMED and 
KL-UCB should be about two times better than that of UCB. The left panel of Figure [T] does show 
the expected behavior but with a difference of lesser magnitude. Indeed, one can observe that the 
bound d~^{fj,2, ^J^l) log(ri) (shown in dashed line) is quite pessimistic for the values of the horizon n 
considere d here as the actual performa nce of KL-UCB is significantly below the bound. For DMED, 
we follow iHonda and Takemural (|2010f ) but using 



as the criterion to decide whether arm a should be included in the list of arms to be played. This 
criterion is clearly related to the decision rule used by KL-UCB when c = (see line 6 of Algorithm[T|) 
except for the fact that in KL-UCB the estimate S'[a]/A^[a] is not compared to that of the current 
best arm maxb S'[6]/A^[6] but to the corresponding upper confidence bound. As a consequence, any 
arm that is not included in the list of arms to be played by DMED would not be played by KL-UCB 
either (assuming that both algorithms share a common history). As one can observe on the left 
panel of Figure [l] this results in a degraded performance for DMED. We also observed this effect 
on UCB, for instance, and it seems that index algorithms are generally preferable to their "arm 
elimination" variant. 

The original proposal of lHonda and Takemural ()2010t ) consists in using in the exploration function 
a factor \og{t/N[a]) instead of log(t), as in the MOSS algorithm. As will be seen below on FigurejU 
this variant (which we refer to as DMED-I-) indeed outperforms DMED. But our previous conjecture 
appears to hold also in this case as the heuristic variant of KL-UCB (termed KL-UCB-I-) in which 
log(t) in line 6 of Algorithm |T] is replaced by \og{t/N\a\) remains preferable to DMED-I-. 

As final comments on Figurejl] first note that UCB-Tuned performs as expected — though slightly 
worse than KL-UCB — but is a very risky algorithm: the right panel of Figure [T] casts some doubts 
on the fact that the tails of Na{n) are indeed controlled uniformly in n for UCB-Tuned. Second, the 
performance of UCB-V is somewhat disappointing. Indeed, the upper-confidence bounds of UCB-V 
differ from those of UCB-Tuned simply by the non-asymptotic cor rection term 3 \og{t) lN[a] required 
by Bennett's and Bernstein's inequalities (jAudibert et al.l . 12009^ . This correction term appears to 
have a significant impact on moderate time horizons: for a sub-optimal arm a, the number of draws 
N[a\ does not grow faster than the log(t) exploration function, and \og{t)/N[a\ does not vanish. 

5.2 Scenario 2: low rewards 

In FigureHjwe consider a significantly more difficult scenario, again with Bernoulli rewards, inspired 
by a situation (frequent in applications like marketing or Internet advertising) where the mean reward 
of each arm is very low. In this scenario, there are ten arms: the optimal arm has expected reward 
0.1, and the nine suboptimal arms consist of three different groups of three (stochastically) identical 
arms each with expected rewards 0.05, 0.02 and 0.01, respectively. We again used N = 50,000 
simulations to obtain the regret plots of Figure H] These plots show, for each algorithm, the average 
cumulated regret together with quantiles of the cumulated regret distribution as a function of time 
(again on a logarithmic scale). 

In this scenario, the difference is more pronounced between UCB and KL-UCB. The performance 
gain of UCB-Tuned is also much less significant. KL-UCB and DMED reach a performance that 
is on par with the lower bound of iBurnetas and KatehakisI ()1997D in although the performance 
of KL-UCB is here again significantly better. Using KL-UCB-f and DMED-I- results in significant 
mean improvements, although there are hints that those algorithms might indeed be too risky with 
occasional very large deviations from the mean regret curve. 

The final algorithm included in this roundup, called CP-UCB, is in some sense a further adap- 
tation of KL-UCB to the specific case of Bernoulli rewards. For n G N and p G [0, 1], denote by 
Pn,p the binomial distribution with parameters n and p. Fo r a random variable X with distribution 




(6) 



Pn,p, the Clopper- Pear son 




upper-confidence bound of 
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UCB UCB2 UCB-V 




n (log scale) n (log scale) n (log scale) 

Figure 2: Regret of the various algorithms as a function of time (on a log scale) in the ten arm 
scenario. On each graph, the red dashed line shows the lower bound, the solid bold curve corresponds 
to the mean regret while the dark and light shaded regions show respectively the central 99% region 
and the upper 0.05% quantile, respectively. 



risk a g]0, 1[ for p is 

u'^^{X,n,a) = max{g e [0,1] : P„^,([0,X]) > a} . 

It is easily verified that P„.p (/i < vP^ (X)^ > 1 — a, and that vP^ (X) is the smallest quantity 
satisfying this property: u'-^^{X) < u{X) for any other upper-confidence bound u{X) of risk at 
most a. 

The Clopper-Pearson Upper-Confidence Bound algorithm (CP-UCB) differs from KL-UCB only 
in the way the upper-confidence bound on the performance of each arm is computed, replacing line 
6 of Algorithm [T] by 

a ^ a.rgmax u'^^ ( S[a],N[a] \ ) . 

l<a<K \ il0g(t)V 

As the Clopper- Wilson confidence intervals are always sharper than the KuUback-Leibler intervals, 
one can very easily adapt the proof of Section |6] to show that the regret bounds proved for the KL- 
UCB algorithm also hold for CP-UCB in the case of Bernoulli rewards. However, the improvement 
over KL-UCB is very limited (often, the two algorithms actually take exactly the same decisions). 
In terms of results, one can observe on Figure [2] that CP-UCB only achieves a performance that is 
marginally better than that of KL-UCB. Besides, there is no guarantee that the CP-UCB algorithm 
is also efficient on arbitrary bounded distributions. 

5.3 Scenario 3: bounded exponential rewards 

In the third example, there are 5 arms: the rewards are exponential variables, with parameters 
1/5, 1/4, 1/3, 1/2 and 1 respectively, truncated at Xmax = 10 (thus, they are bounded in [0,10]). 
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The interest of this scenario is twofold: first, it shows the performance of KL-UCB for non-binary, 
non-discrete, non [0, l]-valued rewards. Second, it illustrates that, as explained in Section[4l specific 
variants of the KL-UCB algorithm can reach an even better performance. 

In this scenario, UCB and MOSS, but also KL-UCB are clearly sub-optimal. UCB-Tuned and 
UCB-V, by taking into account the variance of the reward distributions (much smaller than the 
variance of a {0, 10}-valued distribution with the same expectation), were expected to perform 
significantly better. For the reasons mentionned above this is not the case for UCB-V on a time 
horizon n = 20, 000. Yet, UCB-Tuned is spectacularly more efficient, and is only caught up by 
KL-UCB-exp, the variant of KL-UCB designed for exponential rewards. Actually, the KL-UCB- 
exp algorithm ignores the fact that the rewards are truncated, and uses the divergence d{x,y) — 
x/y — 1 — \og(x/y) prescribed for genuine exponential distributions. One can easily show that this 
choice leads to slightly too large upper confidence bounds. Yet, the performance is still excellent, 
stable, and the algorithm is particularly simple. 

6 Proof of Theorem [2] 

Consider a positive integer n, a small e > 0, an optimal arm a* and a sub-optimal arm a such that 
Ma < ^J■a' ■ Without loss of generality, we will assume that a* = 1. For any arm 6, the past average 
performance of arm b is denoted by p-bit) = Sb{t)/Nh(t); by convenience, for every positive integer 
s we will also denote jj,b,s — {Xb.i + • • ■ + Xb,s) /s, so that fit{b) = iib,Nb(t)- KL-UCB relies on the 
following upper-confidence bound for fib'- 

Uh(t) =argmax{g>/ib(t) : Nb{t) d{ijb{t),q) < log(t) -f 3 log(log(i))} . 

For x,y G [0, 1], define d~^{x, y) — d{x, y)lx<y The expectation of Nn{a) is upper-bounded by using 
the following decomposition: 



E[iV:„(a)] = E 



< E 



^l{At = a,/^i <ui{t)} 



< 



l{sd+(Aa,s,Mi) < logN + 31og(log(n))} 



where the last inequality is a consequence of Lemma [T] The first summand is upper-bounded as 
follows: by Theorem 1101 (proved in the Appendix), 

P{fii > ui{t)) < e \\og{t) (log(t) + 31og(log(t)))l exp(-log(i) - 3 log(log(i))) 
^ e[log(t)^+31og(t)log(log(t))] 
nog(t)3 
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Hence, 



V- ^ ,,,, ^ ^ e[log(0^+31og(01og(logft))] 



t=i t=i 



for some positive constant C'l (C( < 7 is sufficient). For the second summand, let 

1 + e 



Then: 



d+(/iQ,^i) 



(log(ra) + 31og(log(n)) 



^P(sd+(/i,^„^i) <log(n) + 31og(log(n))) < K„ + ^ [sd+ (fta,,, fn) < \og{n) + Slog{\og{n)) 

s=l s=A'„+l 

oo 

<Kn+ J2 P(^nd+(Aa,.,Ml) <l0gH+3l0g(l0g(n)) 

s=K„ + l 



s=A'„ + l ^ 



+ e 

according to Lemma |S1 This will conclude the proof, provided that we prove the following two 
lemmas. 

Lemma 7 

n n 

J2 HA ^a,fii< ui{t)} < J2 Hsd+{fia,s,f^i) < log(n) + 3 log(log(n))} . 

t=l s=l 

Proof: Observe that At — a and /ii < ui{t) together imply that Ua{t) > ui{t) > fii, and hence that 
^+^-u\ \^^i-u\ fn\ log(i) + 31og(log(t)) 

(Ma(<), Ml) < d{^la{t),Ua{t)) = . 

Thus, 

n n 

=a,//i <Mi(t)} <^l{At =a,iV4t)(i+(AaW,Mi) <log(i) + 31og(log(i))} 

n t 

= l{iVt(a) ^s,At^ a,sd+{fia,s,fii) < log(t) + 3 log(log(t))} 

n t 

<J2J2HMa) ^s,At = a}l{sd+{fla,s,^^l) < log(n) + 3 log(log(n))} 

n n 

= ^l{sd+(Aa..,Mi) < log(n) + 31og(log(n))}^l{iVt(a) = s,At^ a} 

S—1 t—S 

n 

= J2 l{srf+(Aa..,Mi) < log(n) + 31og(log(n))} , 

3=1 

as, for every s G {f , . . . , n}, X)tLs l{-^t(a) = s, At — a} < 1. ■ 

Lemma 8 For each e > 0, there exist C2(e) > and /3(e) > such that 



s=K„ + l 



fO 



Proof: If (i"'"(/ta,s, Ml) < '^(Ma,Mi)/(l + ^) ■ then fia.s > ^'(e), where r(e) is such that 

d{r{e),ni) + e). Hence, 

rf^(Aa,s,A*l) < -y^p^^ < P(d(/ia,s,MQ) > d{r{€),fj.a), fla,s > fJ-a) 

< F{fia,s > r{e)) < exp{-sd{r{e),fia)) , 

and 

with C2(e) = (1 — cxp(— (i(r(e), /ia)))""'^ and /3(e) = (1 + e)(i(r(e), /ii)/(i(/Xa, /ii). Easy computations 
show that r(e) = Mq + 0(e), so that C2(e) = 0(e"2) ^nd /3(e) = ©(e^). ■ 



7 Conclusion 

The self-normahzed deviation bound of Theorems [TU] and 1111 together with the new analysis pre- 
sented in Section ini allowed us to design and analyze improved UCB algorithms. In this approach, 
only an upper-bound of the deviations (more precisely, of the exponential moments) of the rewards is 
required, which makes it possible to obtain versatile policies satisfying interesting regret bounds for 
large classes of reward distributions. The resulting index policies are simple, fast, and very efficient 
in practice, even for small time horizons. 
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Appendix: Deviation Results 

We start with a simple lemma justifying the focus on binary rewards. 

Lemma 9 Let X be a random variable taking value in [0, 1], and let /i = E[X]. Then, for all A G R, 

E [exp(AX)] <1- n + /iexp(A) , 

Proof: The function / : [0, 1] — s> M defined by f{x) — exp(Aa;) — x (exp(A) — 1) — 1 is convex and 
such that /(O) = /(I) = 0, hence f{x) < for all x G [0, 1]. Consequently, 

E [exp(AX)] <E[X (exp(A) - 1) + 1] = pi{cxp{X) - 1) + 1 . 



Theorem 10 Let {Xt)t > 1 be a sequence of independent random variables bounded in [0, 1] defined 
on a probability space (fl, J-, P) with common expectation ii — K[Xt] . Let J^t be an increasing sequence 
of a-fields of J- such that for each t, a{Xi . . . , Xt) C Tt and for s > t, Xg is independent from Tt- 
Consider a previsible sequence {et)t>i of Bernoulli variables (for all t > 0, et 'is J-'t~i-measurable). 
Let S > and for every t G {1, . . . , n} let 

s{t)^j2'sXs, N{t) = j2^s, = 

s=l s=l ^ ' 

u{n) — argmax{(7 > /i„ : N{n)d{fl{n),q) < S} . 

Then 

P(u(n) < n) <e \Slog{n)~\ exp(-(5) . 

Proof: For every A G M, let0^(A) = logE [exp (AXi)]. ByLemmaO <^;.(A) < log (1 - + Mcxp (A)). 
Let = 1 and for t > 1, 

W^^ = exp{XS{t) ~ N{t)^^{X)). 
is a super-martingale relative to {^t)t>Q- f^-ct, 

E[exp (A {Sit + 1 - Sit)}) \Tt] = E[exp iXet+iXt+i) \Tt] = exp (ej+i logE [exp (AXi)] ) 

< exp (A) ) = exp ( {7V(t + 1) - Nit)} 0^ (A) ) 

which can be rewritten as 

E[exp iXSit + 1) - Nit + l)<f>^ (A)) \Tt] < exp (A5(i) - Nit)^^ (A)) . 

To proceed, we make use of the so-called 'peeling trick' (see for instance iMassar we 
divide the interval {1, . . . ,n} of possible values for Nin) into "shoes" {tk-i -I- 1, ... , tk} of geometri- 
cally increasing size, and treat the slices independently. We may assume that S > 1, since otherwise 
the bound is trivial. TakeQ r] = l/((5- 1), let to = and for fc G N*, let tk = [(1 + f])''\ . Let D be the 

first integer such that to > n, that is Z? = • Let A^, = {tk-i < Nin) < t^} fl < fi}. 

We have: 

Piuin)<ti)<Fl\jAA<J2^iAk). (7) 

\fe=l / k=l 

Observe that u(n) < /i if and only if fiin) < /i and A^(n)(i(/i(n), n) > S. Let s be the smallest integer 
such that (5/(s + 1) < (i(0;/i); if Nin) < s, then Nin)difiin), fi) < s(i(/i(n), /x) < s(i(0,/i) < S and 
P(u(i) < fl) = 0. Thus, P iAk) = for aU k such that tk < s. 

For k such that tk > s, let ik-i = max{ife_i,s}. Let x G]0,^[ be such that dix; fi) = S/Nin) 
and let A(x) — \ogix (1 — ^)) — log(^ (1 — x)) < 0, so that c?(x; fj.) — Xix)x — (1 — /x -|- /Li exp (A(a;))) . 
Consider z such that z < fi and d(z, /i) = (5/(1 + rf)^ . Observe that: 



if Nin) > ik-i, then 



d'iz;iJ-) = TT- — > 



(1-^77)*' - il+r])Nin) 



5 < 1, it is easy to check that the bound holds whatsoever. 
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if N{n) < tk, then as 



d(jl{n); ^j.) > 



> 



it holds that 



Hence, on the event 



N{n) (1 + r/)*^ 
,5 



pL{n) < /i and d{fi{n); ii) > 



N{n) 



fi{n) < z. 



{ik-x < N{n) <tk}n {fL{n) < Ai} n {difi{n); ^i) > j^} 



it holds that 



A(z)/i(n) - MHz)) > X{z)z - MKz)) = diz;n) > 



{l + rj)N{n)' 



Putting everything together, 

{ik-i < N{n) < tk}r\{fi{n) < ^}n\ d{ii{n); 



C <\{z)fiin)-MXiz)) > 



N{n) J 1^ ^ ' ■^A'v V - N{n) (l + Tj) 
C |a(z)5„ - Nin)MXiz)) > ^1 C |t4^„^(^) > exp (^^) 



As (W^)^^^ is a supermartingale, E 



< E 



Wo 



»({<fe_i < N{n) < tk} n {fiin) > fi} n {N{n)d{fi{n), fi) > 6} 



= 1, and the Markov inequality yields: 

S 



< P ( > exp 



1 + 77 



< exp 



1 + ri 



Finally, by Equation ([7]), 



^[J {ik-i < N{n) < tk} n {uin) < ^ijj < Dexp (^^YT^) 



But as ry = l/{5 -l),D - \ iog(i+i/(^-i)) 

logri 



log(l + ^) 



and as log(l + l/((5 — 1)) > 1/(5, we obtain: 
cxp(— (5 + 1) < e [(51og(7i)] exp(— (5). 



Of course, a symmetric bound for the probability of over-estimating fi can be derived following 
the same lines. Together, they show that for all 6 > 0: 

¥{N{n)d{fi{n),fj.) > d) < 2e {dlogin)] exp(-5) . 

Finally, we state a more general deviation bound for arbitrary reward distributions with finite 
exponential moments. 

Theorem 11 Let {Xt)t > I be a sequence of random variables defined on a probability space 
(51,7^, P) with common expectation fj, = ¥,[Xt]. Assume that the cumulant- generating function 

(j){X) = logE[exp(AX)] 

is defined and finite on some open subset ]Ai, A2[ o/R containing 0. Define the (good) rate function 
d : M X M — >■ R U {+00} of Xi to be the Fenchel-Legendre transform of <j): for all a; G R, 

d{x,p)= sup {Xx — (j){X)} . 

Ae]Ai,A2[ 

Let Tt be an increasing sequence of a-fields of T such that for each t, ff{Xi . . . , Xt) C and 
for s > t, Xs is independent from Tt- Consider a previsible sequence {et)t>i of Bernoulli variables 
(for all t > 0, et is J-t-i-measurable) . Let S > and for every t G {1, . . . , n} let 

* * S{t) 



s=l s=l 

u{n) ~ argmax{(7 > /},„ : N{n)d{p{n),q) < S} . 



Nit) ' 



Then 



P(u(n) < /x) < e [(51og(n)] exp(-(5) . 
The proof, which is very similar to that of Theorem [TOl is omitted. 
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